UWF IDC6940 Capstone Project
2025-04-23
Linear Mixed Models (LMMs) are statistical models that extend traditional linear regression by including both:
Fixed effects: These are the overall effects you’re interested in studying (e.g., how testing volume affects case counts across all countries).
Random effects: These account for variation within subgroups or clusters (e.g., differences between countries), allowing intercepts and/or slopes to vary by group.
LMMs are especially useful when:
The general form of a linear mixed model is: \[ y_{ij} = \beta_0 + \beta_1 x_{ij} + b_{0j} + b_{1j}x_{ij} + \varepsilon_{ij} \] Where,
\(y_{ij}\): response variable for observation in \(i\) in group \(j\)
\(x_{ij}\): predictor (e.g., scaled daily tests)
\(\beta_0,\beta_1\): fixed intercept and slope (population-level)
\(b_{0j}, b_{1j}\): random intercept and slope for group \(j\) (e.g., country)
\(\varepsilon_{ij}\): residual error, assumed to be normally distributed: \(\varepsilon_{ij} \sim N (0, \sigma^2)\)
In matrix form, the LMM can be represented as:
\[ y = X\beta + Zb + \varepsilon \] Where,
\(y\): vector of outcomes
\(X\): fixed-effects design matrix
\(\beta\): vector of fixed effect coefficients
\(Z\): random-effects design matrix
\(b\): vector of random effects, assumed \(\sim N (0, G)\)
\(\varepsilon\): residual errors, assumed \(\sim N(0, R)\)
| Date | Country_Region | Province_State | positive | active | hospitalized | hospitalizedCurr | recovered | death | total_tested | daily_tested | daily_positive |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2020-01-16 | Iceland | All States | 3 | NA | NA | NA | NA | NA | NA | NA | NA |
| 2020-01-17 | Iceland | All States | 4 | NA | NA | NA | NA | NA | NA | NA | 1 |
| 2020-01-18 | Iceland | All States | 7 | NA | NA | NA | NA | NA | NA | NA | 3 |
| 2020-01-20 | South Korea | All States | 1 | NA | NA | NA | NA | NA | 4 | NA | NA |
| 2020-01-22 | United States | All States | 0 | NA | NA | NA | NA | 0 | 0 | NA | NA |
| 2020-01-22 | United States | Massachusetts | 0 | NA | NA | NA | NA | 0 | 0 | NA | NA |
A study was conducted to employ linear mixed models to examine the relationship between the number of daily COVID-19 positive cases reported across various countries over time.The data set consists of repeated daily observations for each country, which results in multilevel structure where each country is considered a separate group with its own data points over time. This hierarchical structure gives the opportunity to account for both the overall trends in the data and the variability across different countries.
The model was designed to account for both the fixed effects of the number of daily tests conducted, which can influence case detection, and the random effects that capture country-specific deviations in testing practices and positivity rates. The model allows the both the intercept, baseline positivity, and the slope, the effect of testing on reported cases, to vary by country.
df <- COVID19
#Convert Date to Date format
df$Date <- as.Date(df$Date)
#Remove rows with missing values in relevant columns
df_clean <- df %>%
filter(!is.na(daily_positive), !is.na(daily_tested), !is.na(Country_Region))
#Rescale daily_tested column to improve interpretation and convergence
df_clean$daily_tested_scaled <- df_clean$daily_tested / 1000Key components of the model
The model formula in R was specified as: lmer(daily_positive ~ daily_tested_scaled + (daily_tested_scaled | Country_Region), data = df_clean)
#Extract and print key fixed effects
fixef(model_refined) #estimates for population-level intercept and slope (Intercept) daily_tested_scaled
95.05436 52.16033
Groups Name Std.Dev. Corr
Country_Region (Intercept) 659.941
daily_tested_scaled 64.935 -0.423
Residual 1725.488
Visualizations of fitted model predictions against observed data showed that:
The relationship between testing and positivity varies across countries.
Some countries exhibit steeper slopes (e.g., South Korea), while others show flatter or more variable trends (e.g., United States).
This validates the use of a random slope model, as a single global slope would fail to capture such heterogeneity.